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^ Abstract 

J The most direct way to express arbitrary dependencies in datasets 

^ is to estimate the joint distribution and to apply afterwards the argmax- 

function to obtain the mode of the corresponding conditional distribu- 
tion. This method is in practice difficult, because it requires a global 
optimization of a complicated function, the joint distribution by fixed 
input variables. This article proposes a method for finding global 

T— I maxima if the joint distribution is modeled by a kernel density esti- 

mation. Some experiments show advantages and shortcomings of the 
resulting regression method in comparison to the standard Nadaraya- 
Watson regression technique, which approximates the optimum by the 

^ expectation value. 

O 1 Introduction 

> 



X 



Regression is a very important method in engineering and science for the 
estimation of the dependencies between two or more variables on the basis of 
^ some given sample points. The best known regression method is certainly the 

parametric regression technique after Legendre and Gauss, which minimizes 
the squared error between a model - often a polynom - and the samples. 

The least squares method is fast and well suitable for strongly linearly 
correlated data, but seldom appropriate for high- dimensional problems with 
difficult, unknown, and non-linear dependencies. For these problems, non- 
parametric regression techniques - like kernel or Nadaraya- Watson regression 
methods - are more suitable ( |Nadaraya| |1964| , |Watson| |1964| ). The first 
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step of Nadaraya- Watson regression is to estimate the unknown joint density 
distribution px, 1^(35 5 y) of the given sample data D = {{xi, y^), . . . , (a;„, ?/„)} 



by a kernel-density estimator Scott , 1992 . The resulting model distribution 



has in the most general case the form 

m 

Pxxi^i y) = ^(^i Mx - Xi)ilJi{y - y^) (1) 

i=l 

with m < n and YaLi = 1- Furthermore, the kernel functions 0j and ipi 
have to be normalized so that the integrals over all values for x and y are 
one. With this model, the conditional distribution pY,x{y\x) can be easily 
derived: 

PY\x[y\x) = , , . = -f-- 7 

Px{x) }px,Y{x,y)(ly 

y 

™ (2'] 
Y,a.i(t)i{x - Xi)i)i{y - y^) ^' 

1=1 



ai (pi{x - Xi) 

i=l 

This distribution represents the relative probabilities for realizations of y 
given a vector x. But for a regression, we do not need a probability distribu- 
tion, but a single vector. The most intuitive choice is, of course, the mode of 
the conditional distribution, that means the value y for which pY\x becomes 
maximal. For this case, the regression function f{x) takes the specific form 

y = f{x) = aigmax{pY\xiy\x)}. (3) 
y 

The difficulty is, however, that the maximization is not easy to calculate, 
because the expression ^ is highly non-linear. 

On the other hand, the expected value of pY\x{y\x) regarding y is easy 
to calculate: 

m 

Y «i yi 4>i{x - Xi) 

ypY\x{y\x) dy = . (4) 

Y^i (piix - Xi) 

i=l 

The idea of Nadaraya and Watson was to approximate expression ([s]) by 

m 

Y^i yi <pi{x - Xi) 
if ~ , (5) 

Y^i <pi{x - X:^ 

i=l 
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which is often sufficiently good. But there are also some potential problems. 

Figure [l] demonstrates this for two different joint distributions. Case (A) 
is uncritical and describes essentially a hyperbolic tangent function. Case 
(B), however, causes problems. Here, the variables x and y are non- linearly 
correlated too, but the underlying dependency cannot be described by a 
function. The difficulty becomes obvious when considering the conditional 
distributions Px\y{x\ — 0.6) and py|x(|/|3), which are no longer unimodal. A 
computation of the expected value would yield in both cases zero, which is 
far away from probable values for ?/ = — 0.6 or x = 3. 

In contrast to the Nadaraya- Watson method, the calculation of expres- 
sion ([3]) should not lead to problems. The argmax-calculation cannot resolve 
the ambiguousness, of course, due to the fact that two global maxima ex- 
ist, but it is better to return only one maximum than a completely incor- 
rect value. Especially for high- dimensional tasks, this shortcoming of the 
Nadaraya- Watson method can be annoying, because the occurrence of am- 
biguousness is difficult to recognize. To overcome this "curse of compromise" , 
the next section proposes a method for solving expression (g directly if the 
density estimation is given in the form ([T]). 
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2 Finding Global Maxima for Kernel Density 
Estimations 



The fundamental idea of the here proposed method to find the global maxi- 
mum of a probability density function is to utilize its special properties. In 
general, the global maximum of an arbitrary function, which is, for example, 
given as a piece of code, can be found only by trial and error. In principle, 
this can be also applied to find the global maximum of a probability density 
function. But this "blind" search would be very inefficient and the remaining 
likelihood to find still better values does not become zero, regardless of how 
long the algorithm runs. 

But for probability densities h{y), this remaining likelihood can be re- 
duced very fast by using /i-distributed sample points, instead of evenly dis- 
tributed samples. Why is it so? To answer this question, we assume that q 
accordingly distributed sample points with i = 1, . . . ,q have been gener- 
ated. For each y^, there is a percentage for more improbable realizations 
y. Let ttj be 99%. The probability that all other q — 1 generated samples 
have lower a- values is (99%/100%)''~^. For q = 10000, this probability is 
only 2.27 10"^'*! In practice, this means that it is impossible not to come 
close to the global maximum with 10000 /i-distributed sample points. That 
is all. 

Fortunately, the generation of accordingly distributed samples is not very 
difficult for kernel density estimations like expression ([T|. In the first step, 
we insert for the calculation of expression ^ the given value x and get 

y = argmax {h{y)) (6) 
y 

with 

m 

Ky)■.= Y,hUy-y^) (7) 

i=l 

and 

(j)i{x — Xi) 

Y.j=i(t)Ax - Xj) 

Note that h fulfills the requirements for a probability density function. Fur- 
thermore, the bi can be interpreted as probabilitie^ because of Y^^=i — ^■ 
In the next step, we generate a dataset 

D' = {y[,...,y'^} (9) 

^Many hi are very low for a given value x. The corresponding kernels should be omitted 
to improve the computation speed. 
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of /i-distributed random samples. For this purpose, we can utilize the distri- 
bution function H of the density function h: 



m m 'i 

-oc -oc (10) 



i{y - Vi 



1=1 



The distribution functions \E'j for the kernels ipi are usually known or at least 
easy to calculate. The generation of the k = 1, . . . , q random samples ^ can 
be performed in three stages: 

1. Choose randomly one of the m kernels ipi corresponding to the proba- 
bilities 6j. 

2. Let j be the choice of the first stage. Generate now a ■j/'j-distributed 
random value using the distribution function "^j. 

3. Add the kernel center yj to the random value from stage two to get a 
random sample y'j^ 

After that, we calculate the function values h{y'f?) for all = 1, . . . , g 
of dataset ([9]). The argument y'^ for which h{y'j^) becomes maximal is then 
a good starting point for a local optimization method like gradient ascent 



Duda et al. 2000 , for example. 



3 Implementation Example 

The subsequent Matlab cod^ snippet implements the described method for 
multidimensional Gaussian kernels with diagonal covariance matrix. 

function [xm,pxm] = f indMax(para, q) 

m = length (para. a) ; 

d = length(para.x(l , : ) ) ; 

cdf = zeros (1 ,ni) ; 

for (i=2:in) cdf(i) = cdf(i-l) + para.a(i-l); end 



^The code was tested with Matlab 7.1. 
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xr = zeros(q,d); yr = zeros(q,l); 
for (i=l:q) 

rv = rand(l) ; Ivi = find(cdf < rv) ; ri = Ivi (length(lvi) ) ; 
xr(i,:) = randnd ,d) . *sqrt (para. s(ri) ) + para.x(ri); 
yr(i) = KDE(xr (i , : ) ,para) ; 

end 

[pxm,xi] = max(yr) ; xm = xr(xi,:); 

The parameters of the expression ([T]) are combined into the structure para 
with three elements: para. a are the weights 6j, para.s the standard devi- 
ations, and para.x the centers of the Gaussian kernels. The function KDE 
calculates the estimated density value for a given x: 

function y = KDE(x,para); 

m = length (para. a) ; y = 0; 
for (i=l:in) 

y = y + para. a(i) *gauss (x, para. x(i ,:) ,para. s(i ,:)) ; 

end 

function y = gauss (x,ni, s) 

y = prodd ./(sqrt(2*pi)*s)) . *exp(sum(-(x-ni) . ~2./(2*s . "2))) ; 

The gradient ascent is not performed in this example. 

Figure [2] shows the results of an experiment with function f indMax for 
different values of q. The parameters of the probability density function h{y) 
were 

para. a = [0.45,0.45,0. 1] ; 

para.x = [ [1 , 1] ; [-1 , -1] ; [-1 . 5 , 1 . 5] ] ; 

para.s = [[1,1] ; [1,1] ; [0.5,0.5]] ; 

That means that there are two global maxima - at approximately (1 1)^ and 
-(1 1)"^. For this reason, both could be the result returned by the algorithm. 
But only one of these possibilities is returned per step. The plots also show 
that the distribution of the computed points becomes more compact with 
increasing size of q. Figure [3] shows the result for a more complex density 
with 80 kernels and several local maxima. 



6 



-3 -2-10 1 2 3 -3 -2 -1 12 3 



Figure 2: Contour plots for a two-dimensional kernel density estimation with 
two global maxima at about (11)^ and —(11)-^. Both were found by the 
algorithm (black marks). For the left hand plot, q was 100 and for the right 
hand plot 1000. 

4 Regression Experiments 

This section investigates the properties of the described method. The first 
experiment compares the standard Nadaraya-Watson method and the pro- 
posed method with computation of the mode in view of its ability to estimate 
a clear functional dependency between x und y. For this purpose, a dataset 
of n = 1000 random sample points of the function y = sin(x5 ) in the interval 
[0, 2 vr] was generated. Furthermore, a slight, Gaussian distributed noise with 
a standard deviation of aN = 0.2 was added to the y-values. The dataset is 
shown on the left of Figure |4j 

Before applying the two regression methods, the distribution of the data 
has to be modeled by a kernel density. Different types of kernels can be 
applied. One of the simplest is the d-dimensional Gaussian kernel 

with s = {si . . . Sd)'^ as only free parameter. Its application to the two 
dimensional problem of Figure |4] yields 

1 " 

P{x) = -y^aix ~ Xi,s) (12) 

i=l 

with X = (xy)'^. For high-dimensional problems, the smoothness s has to be 
automatically optimized with respect to a certain quality measurement, such 



7 



Figure 3: The result for a more complex density with 80 kernels and several 
local maxima, q was 100 (left) and 1000 (right). 



if' ■. •;• 



1 2 3 4 5 6 




Figure 4: Two datasets. 



as the self-contribution Duin 1976 for example. Another method is plug-in 
estimation. A good overview about this topic is given by Turlach 1993 or 



Scott 1992 



For the two-dimensional dataset in Figure |4} it is still possible to esti- 
mate the smoothness parameter visually. For s = (0.10.1)^, the resulting 
density is drawn as contour plot in Figure [5] at the top. Furthermore, the 
picture provides the result of the Nadaraya- Watson regression (left) and of 
the proposed method (right) as white dotted lines. Every point represents 
the outcome for a single given value x. 

The picture demonstrates too that the Nadaraya- Watson regression per- 
forms clearly better. This is only at the first glance surprising. Formally, both 
regression methods should give the same results, because expected value and 
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Figure 5: The results of the Nadaraya- Watson method (left) and of the 
optimization method (right). The density estimation above shows a clear 
dependency between x and y - contrary to the other below. 

maximum are identical for the true conditional probability distribution 

PY\x{y\x) = g{y - sm{x^, a n). (13) 

But both regression techniques have different susceptibilities to estimation 
errors and the property of the effective value to average out noise leads to a 
much smoother curve progression. 

This advantage becomes a shortcoming if the dependencies within the 
data are ambiguous. To demonstrate this, a second dataset was generated 
(Figure |4| right). Now, for most values of x two values of y with different 
emphasis are reasonable. The Nadaraya- Watson regression calculates for 
every x a "compromise" . This can lead to very improbable values for y. The 
optimization method however chooses always the most probable value and is 
because of this immune to this effect. 
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5 Conclusion 



If a dependency between some data is clear and unambiguous, the standard 
Nadaraya- Watson method or - still better - the local linearizing Nadaraya- 



Watson approach Cleveland 1979 should be preferred for the modeling. 
But for numerous real life applications, it cannot be guaranteed that this 
condition is fulfilled because, for example, the data may be collected online. 
Another difficulty is a high dimensionality. The dependency may be simple 
and unambiguous between some of the vector components, but between oth- 
ers it may not. Every input-output combination has to be checked, what is 
mostly impracticable. 

For such general and complex cases, the proposed method is more suit- 
able, because the assumption of unambiguousness is not necessary. The ap- 
proach returns always a prediction that is probable in respect to the knowl- 
edge given by the sample data. For some applications, this property is more 
important than continuity of the curve and its smoothness. 

An example for such a scenario is machine control. The data are in this 
case measurements from actuators and sensors. The controller continuously 
has to solve the problem which actuator values leads to the desired sensor 
values. For this purpose, already one good setting is sufficient, regardless of 
whether several possibilities exist or not. An average value or a "compro- 
mise" , however, is mostly a bad decision. 



References 

W.S. Cleveland. Robust locally weighted regression and smoothing scatter- 
plots. Journal of the American Stastical Association, 1979. 

Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. 
John Wiley & Sons, Inc., 2000. 

R.P.W. Duin. On the choice of the smoothing parameters for parzen estima- 
tors of probability density functions. IEEE Transactions on Computers, 
Vol. C-25, No. 11:1175-1179, 1976. 

E. A. Nadaraya. On estimating regression. Theory of Probability and its 
Applications, Vol. 9:141-142, 1964. 

David W. Scott. Multivariate Density Estimation. John Wiley & Sons, Inc., 
1992. 



10 



Berwin A. Turlach. Bandwidth selection in kernel density estimation: A 
review. In CORE and Institut de Statistique, 1993. 

G. S. Watson. Smooth regression analysis. Sankhya, Series A, 26:359-372, 
1964. 



11 



